IMG-20230626-WA0013.jpg

BUSINESS PROBLEM:¶

B-76 technologies, a provider of high-end software services to customers in a variety of sectors, stores and manages confidential client data.Due to an upsurge in cyber attacks B-76 technologies is concerned about their network security. In order to resolve this problem, B-76 technologies wants to develop a "Network Intrusion Detection System (NIDS)" which monitors the network traffic for unusual activities and sends alerts when such activity is discovered. NIDS is essential for recognizing and preventing cyber attacks, safeguarding data integrity and confidentiality.

As an experienced Data Scientist appointed by the Company my objective is to develop a NIDS using machine learning algorithms that can detect and prevent network intrusion.The steps involved in NIDS are:

  1. Importing the necessary Libraries
  2. Data Extraction
  3. Data Exploration
  4. Data Preprocessing
  5. Feature Engineering
  6. Model Assessment (Training and Testing the Model)

The dataset required for developing NIDS was gathered from Kaggle. The following is the hyperlink to the dataset:

https://www.kaggle.com/datasets/mostafaomar2372/nf-unsw-nb15

In [1]:
!pip install imblearn
!pip install category_encoders
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting imblearn
  Downloading imblearn-0.0-py2.py3-none-any.whl (1.9 kB)
Requirement already satisfied: imbalanced-learn in /usr/local/lib/python3.10/dist-packages (from imblearn) (0.10.1)
Requirement already satisfied: numpy>=1.17.3 in /usr/local/lib/python3.10/dist-packages (from imbalanced-learn->imblearn) (1.22.4)
Requirement already satisfied: scipy>=1.3.2 in /usr/local/lib/python3.10/dist-packages (from imbalanced-learn->imblearn) (1.10.1)
Requirement already satisfied: scikit-learn>=1.0.2 in /usr/local/lib/python3.10/dist-packages (from imbalanced-learn->imblearn) (1.2.2)
Requirement already satisfied: joblib>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from imbalanced-learn->imblearn) (1.2.0)
Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from imbalanced-learn->imblearn) (3.1.0)
Installing collected packages: imblearn
Successfully installed imblearn-0.0
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting category_encoders
  Downloading category_encoders-2.6.1-py2.py3-none-any.whl (81 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 81.9/81.9 kB 8.5 MB/s eta 0:00:00
Requirement already satisfied: numpy>=1.14.0 in /usr/local/lib/python3.10/dist-packages (from category_encoders) (1.22.4)
Requirement already satisfied: scikit-learn>=0.20.0 in /usr/local/lib/python3.10/dist-packages (from category_encoders) (1.2.2)
Requirement already satisfied: scipy>=1.0.0 in /usr/local/lib/python3.10/dist-packages (from category_encoders) (1.10.1)
Requirement already satisfied: statsmodels>=0.9.0 in /usr/local/lib/python3.10/dist-packages (from category_encoders) (0.13.5)
Requirement already satisfied: pandas>=1.0.5 in /usr/local/lib/python3.10/dist-packages (from category_encoders) (1.5.3)
Requirement already satisfied: patsy>=0.5.1 in /usr/local/lib/python3.10/dist-packages (from category_encoders) (0.5.3)
Requirement already satisfied: python-dateutil>=2.8.1 in /usr/local/lib/python3.10/dist-packages (from pandas>=1.0.5->category_encoders) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas>=1.0.5->category_encoders) (2022.7.1)
Requirement already satisfied: six in /usr/local/lib/python3.10/dist-packages (from patsy>=0.5.1->category_encoders) (1.16.0)
Requirement already satisfied: joblib>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from scikit-learn>=0.20.0->category_encoders) (1.2.0)
Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn>=0.20.0->category_encoders) (3.1.0)
Requirement already satisfied: packaging>=21.3 in /usr/local/lib/python3.10/dist-packages (from statsmodels>=0.9.0->category_encoders) (23.1)
Installing collected packages: category_encoders
Successfully installed category_encoders-2.6.1

IMPORTING THE LIBRARIES:¶

To access the pre-existing modules or packages, libraries must be imported to this programming environment.

In [2]:
import os
import pandas as pd
import numpy as np
from sklearn.model_selection import GridSearchCV
import sklearn.preprocessing
import category_encoders
import imblearn.over_sampling
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
import sklearn.metrics
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
from xgboost import XGBClassifier
from sklearn.metrics import classification_report
import plotly.graph_objects as go
from plotly .subplots import make_subplots

DATA EXTRACTION:¶

The data required for NIDS is loaded and stored as a Data Frame for analysis. The head() function helps us to preview the first 5 initial rows of the dataset, providing an overview of the structure of the data.

In [3]:
unsw_df = pd.read_csv("NF-UNSW-NB15.csv")
unsw_df.head()
Out[3]:
Unnamed: 0 IPV4_SRC_ADDR L4_SRC_PORT IPV4_DST_ADDR L4_DST_PORT PROTOCOL L7_PROTO IN_BYTES IN_PKTS OUT_BYTES ... TCP_WIN_MAX_IN TCP_WIN_MAX_OUT ICMP_TYPE ICMP_IPV4_TYPE DNS_QUERY_ID DNS_QUERY_TYPE DNS_TTL_ANSWER FTP_COMMAND_RET_CODE Label Attack
0 975176 59.166.0.9 30659 149.171.126.5 53 17 0.0 146 2 178 ... 0 0 0 0 53862 1 60 0.0 0 Benign
1 1475060 59.166.0.2 41056 149.171.126.3 64665 6 0.0 320 6 1902 ... 7240 5792 0 0 0 0 0 0.0 0 Benign
2 2149826 59.166.0.7 1867 149.171.126.9 53 17 0.0 146 2 178 ... 0 0 0 0 41710 1 60 0.0 0 Benign
3 931632 59.166.0.6 1235 149.171.126.5 31940 6 0.0 2230 34 15258 ... 20272 14480 6912 27 0 0 0 0.0 0 Benign
4 1614143 59.166.0.0 26575 149.171.126.2 21 6 1.0 2059 37 2816 ... 21720 18824 63744 249 0 0 0 125.0 0 Benign

5 rows × 46 columns

Dataset Overview:¶

The UNSW-NB15 dataset, developed by the University of New South Wales is a comprehensive dataset designed for Network Intrusion Detection System.The dataset comprises of network packets that were captured using the IXIA PerfectStorm tool within the UNSW Canberra Cyber Range Lab. This dataset is a combination of real modern normal activities with synthetic contemporary attack behaviors. The dataset consist of 25,500 instances and 46 columns. The dataset comprises of diverse range of features extracted from network traffic data.Below is a list of features in the dataset, encompassing source and destination IP addresses, port numbers,flow duration and various other attributes.

In [4]:
df = pd.read_excel("NF-UNSWNB15-Features.xlsx")
df.head(45)
Out[4]:
FEATURES DESCRIPTION
0 IPV4_SRC_ADDR IPv4 source address
1 IPV4_DST_ADDR IPv4 destination address
2 L4_SRC_PORT IPv4 source port number
3 L4_DST_PORT IPv4 destination port number
4 PROTOCOL IP protocol identifier byte
5 L7_PROTO Layer 7 protocol (numeric)
6 IN_BYTES Incoming number of bytes
7 OUT_BYTES Outgoing number of bytes
8 IN_PKTS Incoming number of packets
9 OUT_PKTS Outgoing number of packets
10 FLOW_DURATION_MILLISECONDS Flow duration in milliseconds
11 TCP_FLAGS Cumulative of all TCP flags
12 CLIENT_TCP_FLAGS Cumulative of all client TCP flags
13 SERVER_TCP_FLAGS Cumulative of all server TCP flags
14 DURATION_IN Client to Server stream duration (msec)
15 DURATION_OUT Client to Server stream duration (msec)
16 MIN_TTL Min flow TTL
17 MAX_TTL Max flow TTL
18 LONGEST_FLOW_PKT Longest packet (bytes) of the flow
19 SHORTEST_FLOW_PKT Shortest packet (bytes) of the flow
20 MIN_IP_PKT_LEN Len of the smallest flow IP packet observed
21 MAX_IP_PKT_LEN Len of the largest flow IP packet observed
22 SRC_TO_DST_SECOND_BYTES Src to dst Bytes/sec
23 DST_TO_SRC_SECOND_BYTES Dst to src Bytes/sec
24 RETRANSMITTED_IN_BYTES Number of retransmitted TCP flow bytes (src->dst)
25 RETRANSMITTED_IN_PKTS Number of retransmitted TCP flow packets (src-...
26 RETRANSMITTED_OUT_BYTES Number of retransmitted TCP flow bytes (dst->src)
27 RETRANSMITTED_OUT_PKTS Number of retransmitted TCP flow packets (dst-...
28 SRC_TO_DST_AVG_THROUGHPUT Src to dst average thpt (bps)
29 DST_TO_SRC_AVG_THROUGHPUT Dst to src average thpt (bps)
30 NUM_PKTS_UP_TO_128_BYTES Packets whose IP size <= 128
31 NUM_PKTS_128_TO_256_BYTES Packets whose IP size > 128 and <= 256
32 NUM_PKTS_256_TO_512_BYTES Packets whose IP size > 256 and <= 512
33 NUM_PKTS_512_TO_1024_BYTES Packets whose IP size > 512 and <= 1024
34 NUM_PKTS_1024_TO_1514_BYTES Packets whose IP size >��1024 and <= 1514
35 TCP_WIN_MAX_IN Max TCP Window (src->dst)
36 TCP_WIN_MAX_OUT Max TCP Window (dst->src)
37 ICMP_TYPE ICMP Type * 256 + ICMP code
38 ICMP_IPV4_TYPE ICMP Type
39 DNS_QUERY_ID DNS query transaction Id
40 DNS_QUERY_TYPE DNS query type (e.g. 1=A, 2=NS..)
41 DNS_TTL_ANSWER TTL of the first A record (if any)
42 FTP_COMMAND_RET_CODE FTP client command return code
43 LABEL Indicates the network traffic. 0 indicates Nor...
44 ATTACK This column represents the type of "Attack"

EXPLORING THE DATA:¶

Data Exploration is a very crucial step in machine learning that helps us to analyze data and gain valuable insights from it. It provides us an overview of the data structure including information like column names, data type and missing values

In [5]:
unsw_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25500 entries, 0 to 25499
Data columns (total 46 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   Unnamed: 0                   25500 non-null  int64  
 1   IPV4_SRC_ADDR                25500 non-null  object 
 2   L4_SRC_PORT                  25500 non-null  int64  
 3   IPV4_DST_ADDR                25500 non-null  object 
 4   L4_DST_PORT                  25500 non-null  int64  
 5   PROTOCOL                     25500 non-null  int64  
 6   L7_PROTO                     25500 non-null  float64
 7   IN_BYTES                     25500 non-null  int64  
 8   IN_PKTS                      25500 non-null  int64  
 9   OUT_BYTES                    25500 non-null  int64  
 10  OUT_PKTS                     25500 non-null  int64  
 11  TCP_FLAGS                    25500 non-null  int64  
 12  CLIENT_TCP_FLAGS             25500 non-null  int64  
 13  SERVER_TCP_FLAGS             25500 non-null  int64  
 14  FLOW_DURATION_MILLISECONDS   25500 non-null  int64  
 15  DURATION_IN                  25500 non-null  int64  
 16  DURATION_OUT                 25500 non-null  int64  
 17  MIN_TTL                      25500 non-null  int64  
 18  MAX_TTL                      25500 non-null  int64  
 19  LONGEST_FLOW_PKT             25500 non-null  int64  
 20  SHORTEST_FLOW_PKT            25500 non-null  int64  
 21  MIN_IP_PKT_LEN               25500 non-null  int64  
 22  MAX_IP_PKT_LEN               25500 non-null  int64  
 23  SRC_TO_DST_SECOND_BYTES      25500 non-null  float64
 24  DST_TO_SRC_SECOND_BYTES      25500 non-null  float64
 25  RETRANSMITTED_IN_BYTES       25500 non-null  int64  
 26  RETRANSMITTED_IN_PKTS        25500 non-null  int64  
 27  RETRANSMITTED_OUT_BYTES      25500 non-null  int64  
 28  RETRANSMITTED_OUT_PKTS       25500 non-null  int64  
 29  SRC_TO_DST_AVG_THROUGHPUT    25500 non-null  int64  
 30  DST_TO_SRC_AVG_THROUGHPUT    25500 non-null  int64  
 31  NUM_PKTS_UP_TO_128_BYTES     25500 non-null  int64  
 32  NUM_PKTS_128_TO_256_BYTES    25500 non-null  int64  
 33  NUM_PKTS_256_TO_512_BYTES    25500 non-null  int64  
 34  NUM_PKTS_512_TO_1024_BYTES   25500 non-null  int64  
 35  NUM_PKTS_1024_TO_1514_BYTES  25500 non-null  int64  
 36  TCP_WIN_MAX_IN               25500 non-null  int64  
 37  TCP_WIN_MAX_OUT              25500 non-null  int64  
 38  ICMP_TYPE                    25500 non-null  int64  
 39  ICMP_IPV4_TYPE               25500 non-null  int64  
 40  DNS_QUERY_ID                 25500 non-null  int64  
 41  DNS_QUERY_TYPE               25500 non-null  int64  
 42  DNS_TTL_ANSWER               25500 non-null  int64  
 43  FTP_COMMAND_RET_CODE         25500 non-null  float64
 44  Label                        25500 non-null  int64  
 45  Attack                       25500 non-null  object 
dtypes: float64(4), int64(39), object(3)
memory usage: 8.9+ MB

Dropping unwanted columns and converting the IPV4_SRC_ADDR & IPV4_DST_ADDR column to integer

The column unnamed is not important for the analysis, hence it is dropped from the dataframe. During the exploration of the dataset, it was found that the columns "IPV4_SRC_ADDR" & "IPV4_DST_ADDR" are represented as strings. Hence,in the following steps they are converted to integers by removing the dot.

In [6]:
unsw_df.drop(["Unnamed: 0"], axis =1, inplace = True)
unsw_df["IPV4_SRC_ADDR"] = unsw_df["IPV4_SRC_ADDR"].str.replace(".","").astype(int)
unsw_df["IPV4_DST_ADDR"] = unsw_df['IPV4_DST_ADDR'].str.replace(".","").astype(int)
unsw_df.head()
<ipython-input-6-46b07e62a1b1>:2: FutureWarning: The default value of regex will change from True to False in a future version. In addition, single character regular expressions will *not* be treated as literal strings when regex=True.
  unsw_df["IPV4_SRC_ADDR"] = unsw_df["IPV4_SRC_ADDR"].str.replace(".","").astype(int)
<ipython-input-6-46b07e62a1b1>:3: FutureWarning: The default value of regex will change from True to False in a future version. In addition, single character regular expressions will *not* be treated as literal strings when regex=True.
  unsw_df["IPV4_DST_ADDR"] = unsw_df['IPV4_DST_ADDR'].str.replace(".","").astype(int)
Out[6]:
IPV4_SRC_ADDR L4_SRC_PORT IPV4_DST_ADDR L4_DST_PORT PROTOCOL L7_PROTO IN_BYTES IN_PKTS OUT_BYTES OUT_PKTS ... TCP_WIN_MAX_IN TCP_WIN_MAX_OUT ICMP_TYPE ICMP_IPV4_TYPE DNS_QUERY_ID DNS_QUERY_TYPE DNS_TTL_ANSWER FTP_COMMAND_RET_CODE Label Attack
0 5916609 30659 1491711265 53 17 0.0 146 2 178 2 ... 0 0 0 0 53862 1 60 0.0 0 Benign
1 5916602 41056 1491711263 64665 6 0.0 320 6 1902 8 ... 7240 5792 0 0 0 0 0 0.0 0 Benign
2 5916607 1867 1491711269 53 17 0.0 146 2 178 2 ... 0 0 0 0 41710 1 60 0.0 0 Benign
3 5916606 1235 1491711265 31940 6 0.0 2230 34 15258 36 ... 20272 14480 6912 27 0 0 0 0.0 0 Benign
4 5916600 26575 1491711262 21 6 1.0 2059 37 2816 39 ... 21720 18824 63744 249 0 0 0 125.0 0 Benign

5 rows × 45 columns

While examining the dataset , it was observed that more than half of the data instances were labelled 0, and the remaining 1. This labelling pattern may introduce bias in the training and evaluation process. To address this issue, the dataset was shuffled ensuring a more even distribution of the labelled categories.

In [7]:
unsw_df = unsw_df.sample(frac=1).reset_index(drop= True)
unsw_df.head(15)
Out[7]:
IPV4_SRC_ADDR L4_SRC_PORT IPV4_DST_ADDR L4_DST_PORT PROTOCOL L7_PROTO IN_BYTES IN_PKTS OUT_BYTES OUT_PKTS ... TCP_WIN_MAX_IN TCP_WIN_MAX_OUT ICMP_TYPE ICMP_IPV4_TYPE DNS_QUERY_ID DNS_QUERY_TYPE DNS_TTL_ANSWER FTP_COMMAND_RET_CODE Label Attack
0 5916601 54210 1491711269 21 6 1.0 2059 37 2814 39 ... 21720 18824 63744 249 0 0 0 125.0 0 Benign
1 5916600 56539 1491711265 53 17 0.0 146 2 178 2 ... 0 0 0 0 31102 1 60 0.0 0 Benign
2 5916609 59133 1491711269 80 6 7.0 1044 8 824 10 ... 7240 7240 27136 106 0 0 0 0.0 0 Benign
3 5916608 26923 1491711265 53 17 0.0 146 2 178 2 ... 0 0 0 0 52366 1 60 0.0 0 Benign
4 5916608 53655 1491711264 21 6 1.0 481 9 750 11 ... 10136 10136 33792 132 0 0 0 229.0 0 Benign
5 5916604 57919 1491711265 5190 6 0.0 1470 22 1728 14 ... 10136 11584 28416 111 0 0 0 0.0 0 Benign
6 5916609 61832 1491711263 18099 6 0.0 2766 44 27770 46 ... 27512 14480 8960 35 0 0 0 0.0 0 Benign
7 175451763 35333 14917112611 80 6 7.0 1270 12 5082 12 ... 16383 16383 51968 203 0 0 0 0.0 0 Benign
8 5916605 37915 1491711267 21 6 1.0 1817 33 2510 35 ... 20272 17376 46080 180 0 0 0 229.0 0 Benign
9 5916603 57728 1491711268 21 6 1.0 481 9 750 11 ... 10136 10136 33792 132 0 0 0 229.0 0 Benign
10 5916604 30017 1491711269 53 17 0.0 146 2 178 2 ... 0 0 0 0 53476 1 60 0.0 0 Benign
11 5916607 17767 1491711268 15938 6 0.0 3302 54 35400 56 ... 34752 14480 11008 43 0 0 0 0.0 0 Benign
12 5916606 3613 1491711263 54195 6 0.0 3926 66 56022 68 ... 43440 14480 11008 43 0 0 0 0.0 0 Benign
13 5916603 7406 1491711266 2413 6 0.0 2854 46 29168 48 ... 28960 14480 6912 27 0 0 0 0.0 0 Benign
14 5916603 19508 1491711269 17087 6 0.0 4014 68 61374 70 ... 44888 14480 8960 35 0 0 0 0.0 0 Benign

15 rows × 45 columns

checking for missing values in the dataset.

In [8]:
unsw_df.isna().sum()
Out[8]:
IPV4_SRC_ADDR                  0
L4_SRC_PORT                    0
IPV4_DST_ADDR                  0
L4_DST_PORT                    0
PROTOCOL                       0
L7_PROTO                       0
IN_BYTES                       0
IN_PKTS                        0
OUT_BYTES                      0
OUT_PKTS                       0
TCP_FLAGS                      0
CLIENT_TCP_FLAGS               0
SERVER_TCP_FLAGS               0
FLOW_DURATION_MILLISECONDS     0
DURATION_IN                    0
DURATION_OUT                   0
MIN_TTL                        0
MAX_TTL                        0
LONGEST_FLOW_PKT               0
SHORTEST_FLOW_PKT              0
MIN_IP_PKT_LEN                 0
MAX_IP_PKT_LEN                 0
SRC_TO_DST_SECOND_BYTES        0
DST_TO_SRC_SECOND_BYTES        0
RETRANSMITTED_IN_BYTES         0
RETRANSMITTED_IN_PKTS          0
RETRANSMITTED_OUT_BYTES        0
RETRANSMITTED_OUT_PKTS         0
SRC_TO_DST_AVG_THROUGHPUT      0
DST_TO_SRC_AVG_THROUGHPUT      0
NUM_PKTS_UP_TO_128_BYTES       0
NUM_PKTS_128_TO_256_BYTES      0
NUM_PKTS_256_TO_512_BYTES      0
NUM_PKTS_512_TO_1024_BYTES     0
NUM_PKTS_1024_TO_1514_BYTES    0
TCP_WIN_MAX_IN                 0
TCP_WIN_MAX_OUT                0
ICMP_TYPE                      0
ICMP_IPV4_TYPE                 0
DNS_QUERY_ID                   0
DNS_QUERY_TYPE                 0
DNS_TTL_ANSWER                 0
FTP_COMMAND_RET_CODE           0
Label                          0
Attack                         0
dtype: int64

Checking the Label Distribution:

In [9]:
unsw_df.Label.value_counts()
Out[9]:
0    24500
1     1000
Name: Label, dtype: int64

Data Visualization:

The bar graph presented below provides valuable insights into the distribution of attack types and also helps us to identify the most common attack type in the network.

In [10]:
plt.figure(figsize = (10,5))
ax = unsw_df.Attack.value_counts().plot(kind = "bar")
plt.title("Attack Types")
plt.xlabel("Attack Type")
plt.ylabel("Frequency")

for i,count in enumerate(unsw_df.Attack.value_counts()):
    ax.text(i, count+1.0, str(count), ha = "center", va = "bottom")
plt.show()

SPLITTING THE DATASET¶

To perform model training and evaluation, the dataset is split into training and testing subsets. 80% of the data is used for training and 20% for testing. The training set will train the model on known data, allowing them to learn patterns and relationships between features and label. The testing set is for model assessment on unseen data, measuring their ability to predict Labels.

In [11]:
X = unsw_df.drop(["Label"], axis = 1)
Y = unsw_df.Label
X_train, X_test, Y_train, Y_test = sklearn.model_selection.train_test_split(X,Y, test_size = 0.2)

print("The Shape of unsw_df is ", unsw_df.shape)
print("The Shape of X_train is ", X_train.shape)
print("The Shape of X_test is ", X_test.shape)
print("The Shape of Y_train is ", Y_train.shape)
print("The Shape of Y_test is ", Y_test.shape)
The Shape of unsw_df is  (25500, 45)
The Shape of X_train is  (20400, 44)
The Shape of X_test is  (5100, 44)
The Shape of Y_train is  (20400,)
The Shape of Y_test is  (5100,)

Finding correlation betweeb Features

The importance of using a heatmap in NIDS lies in its abilities to provide a clear picture of the interdependancies of features, reflecting potential indicators of network intrusions.

In [12]:
plt.figure(figsize = (32,25))
sns.heatmap(unsw_df.corr(),annot = True, cmap = "viridis")
plt.show()
<ipython-input-12-3bc1909ea1ea>:2: FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.
  sns.heatmap(unsw_df.corr(),annot = True, cmap = "viridis")

DATA ENCODING¶

To make the analysis easier the categorical features are converted to numerical features using Binary Encoders. This will enhance the performance of the model.

In [13]:
import category_encoders
be = category_encoders.binary.BinaryEncoder()
be.fit(X_train)

X_train = be.transform(X_train)
X_test = be.transform(X_test)
print("X_train:", X_train.shape)
print("x_test:", X_test.shape)
X_train: (20400, 47)
x_test: (5100, 47)

FEATURE SCALING¶

In order to ensure model Stability we use feature scaling,a technique that ensures all features are on a similar scale, preventing some features from dominating others.

In [14]:
feature_scaler = sklearn.preprocessing.StandardScaler(with_mean = False)
feature_scaler.fit(X_train)

X_train= feature_scaler.transform(X_train)

X_test = feature_scaler.transform(X_test)

print("X_train:", X_train.shape)
print("X_test:", X_test.shape)
X_train: (20400, 47)
X_test: (5100, 47)

Checking for class imbalance

In [15]:
label_distribution = Y_train.value_counts()

colours = ["red","blue"]
plt.bar(label_distribution.index,label_distribution.values, color = colours)
plt.xlabel("Label")
plt.ylabel("Frequency")
plt.xticks([0,1],["0", "1"])
plt.title("Label Distribution")
for i,count in enumerate(label_distribution):
    plt.text(i, count+1.0, str(count), ha = "center", va = "bottom")

BALANCING CLASSES¶

The graph plotted shows that there is a significant imbalance in the dataset. This may lead to biasing towards the majority class. To avoid this the classes are balanced using the SMOTE technique.

In [16]:
over_sampling_sm = imblearn.over_sampling.SMOTE()
X_train,Y_train = over_sampling_sm.fit_resample(X_train,Y_train)

Y_train.value_counts()
Out[16]:
0    19589
1    19589
Name: Label, dtype: int64

MODEL ASSESSMENT:¶

To assess the performance and effectiveness of the model, a comprehensive evaluation is used to measure its accuracy and reliability. The models selected for training are Random forest, XG Boosting, MLP Classifier. To enhance the performance of the model hyper parameter tuning is also done. It also incorporates accuracy score, classification report and ROC CURVE analysis. The accuracy score indicates proportion of correctly classified instances. The classification report provides detyailed metrics such as precision, recall and F1score showing the model's ability to balance precision and recall for each class. Finally the ROC Curve demonstrates the model's ability to classify between positive and negative classes.

MODEL 1: RANDOM FOREST¶

Random forest has the ability to handle high dimensional data and effectively deals with overfitting. It uses an ensemble of decision trees and incorporates randomness in the feature selection, resulting in accurate predictions.

In [17]:
random_forest = RandomForestClassifier()
random_forest_params = {
    "n_estimators":[10,50,100],
            "max_depth": [None,3,5],
}
random_forest_grid_search = GridSearchCV(random_forest, random_forest_params, cv = 5)
rf = random_forest_grid_search.fit(X_train,Y_train )
random_forest_predicted = random_forest_grid_search.predict(X_test)
random_forest_accuracy_score = sklearn.metrics.accuracy_score(Y_test,random_forest_predicted)
report_random_forest = classification_report(Y_test,random_forest_predicted)
print(report_random_forest)
print("The accuracy score of Random Forest algorithm is ",random_forest_accuracy_score*100,"%")
print('Random Forest best parameters:', random_forest_grid_search.best_params_)
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      4911
           1       0.99      1.00      0.99       189

    accuracy                           1.00      5100
   macro avg       0.99      1.00      1.00      5100
weighted avg       1.00      1.00      1.00      5100

The accuracy score of Random Forest algorithm is  99.9607843137255 %
Random Forest best parameters: {'max_depth': None, 'n_estimators': 100}

ROC CURVE FOR RANDOM FOREST

In [27]:
from sklearn.metrics import roc_curve, roc_auc_score
rf_prob = rf.predict_proba(X_test)[:,1]
fpr,tpr,thresholds = roc_curve(Y_test,rf_prob )
auc_rf = roc_auc_score(Y_test,rf_prob)
train_accuracy = rf.score(X_train, Y_train)
test_accuracy = rf.score(X_test, Y_test)
plt.plot(fpr,tpr, label = "Random Forest (Training Accuracy: {:,.2F},Testing Accuracy: {:,.2F})".format(train_accuracy,test_accuracy))
plt.plot([0,1], [0,1], linestyle = "--", color = "r", label = "Reference line")
plt.xlabel("FALSE POSITIVE RATE")
plt.ylabel("TRUE POSITIVE RATE")
plt.title("ROC CURVE")
plt.legend(loc = "lower right")
plt.show()
print ("The AUC score of Random forest Classifier is ",auc_rf )
The AUC score of Random forest Classifier is  0.9999983839324096

MODEL 2 : XG BOOSTING:¶

XG Boosting is a powerful gradient boosting algorithm, which combines multiple weak learners to create a strong predictive model. it also has built in regularization techniques to prevent overfitting.

In [19]:
xgb = XGBClassifier()
xgb_params = {
    "n_estimators":[50,100,150],
    "max_depth":[3,6,9],
    "learning_rate":[0.1,0.01,0.001]
}
xgb_grid_search = GridSearchCV(xgb, xgb_params, cv = 5)
xgb_classifier = xgb_grid_search.fit(X_train,Y_train )
xgb_predicted = xgb_grid_search.predict(X_test)
xgb_accuracy_score = sklearn.metrics.accuracy_score(Y_test,xgb_predicted)
report_xgb = classification_report(Y_test,xgb_predicted)
print(report_xgb)
print("The accuracy score of XGBoost algorithm is ",xgb_accuracy_score*100,"%" )
print('XG Boost best parameters:', xgb_grid_search.best_params_)
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      4911
           1       1.00      1.00      1.00       189

    accuracy                           1.00      5100
   macro avg       1.00      1.00      1.00      5100
weighted avg       1.00      1.00      1.00      5100

The accuracy score of XGBoost algorithm is  100.0 %
XG Boost best parameters: {'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 100}

ROC CURVE FOR XGB CLASSIFIER

In [29]:
from sklearn.metrics import roc_curve, roc_auc_score
xgb_prob = xgb_classifier.predict_proba(X_test)[:,1]
fpr,tpr,thresholds = roc_curve(Y_test,xgb_prob )
auc_xgb = roc_auc_score(Y_test,xgb_prob)
train_accuracy = xgb_classifier.score(X_train, Y_train)
test_accuracy = xgb_classifier.score(X_test, Y_test)
plt.plot(fpr,tpr, label = "XGB (Training Accuracy: {:,.2F},Testing Accuracy: {:,.2F})".format(train_accuracy,test_accuracy))
plt.plot([0,1], [0,1], linestyle = "--", color = "r", label = "Reference line")
plt.xlabel("FALSE POSITIVE RATE")
plt.ylabel("TRUE POSITIVE RATE")
plt.title("ROC CURVE")
plt.legend()
plt.show()
print ("The AUC score of XGB Classifier is ",auc_xgb )
The AUC score of XGB Classifier is  1.0

MODEL 3 : MLP CLASSIFIER¶

MLP can handle complex non linear relationships between features . The dataset has a wide range of features extracted from network traffic and MLP's ability to model and learn intricate relationships makes it a suitable choice for this task

In [21]:
mlp = MLPClassifier()
mlp_params = {
    "hidden_layer_sizes": [(10,), (50,), (100,)],
    "activation": ["relu","tanh"],
}
mlp_grid_search = GridSearchCV(mlp, mlp_params, cv = 5)
mlp_classifier = mlp_grid_search.fit(X_train,Y_train )
mlp_predicted = mlp_grid_search.predict(X_test)
mlp_accuracy_score = sklearn.metrics.accuracy_score(Y_test,mlp_predicted)
report_mlp = classification_report(Y_test,mlp_predicted)
print(report_mlp)
print("The accuracy score of MLP classifier is ",mlp_accuracy_score*100,"%")
print('MLP best parameters:', mlp_grid_search.best_params_)
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      4911
           1       1.00      1.00      1.00       189

    accuracy                           1.00      5100
   macro avg       1.00      1.00      1.00      5100
weighted avg       1.00      1.00      1.00      5100

The accuracy score of MLP classifier is  100.0 %
MLP best parameters: {'activation': 'relu', 'hidden_layer_sizes': (10,)}

ROC CURVE FOR MLP CLASSIFIER

In [31]:
from sklearn.metrics import roc_curve, roc_auc_score
mlp_prob = mlp_classifier.predict_proba(X_test)[:,1]
fpr,tpr,thresholds = roc_curve(Y_test,mlp_prob )
auc_mlp = roc_auc_score(Y_test,mlp_prob)
train_accuracy = mlp_classifier.score(X_train, Y_train)
test_accuracy = mlp_classifier.score(X_test, Y_test)
plt.plot(fpr,tpr, label = "MLP (Training Accuracy: {:,.2F},Testing Accuracy: {:,.2F})".format(train_accuracy,test_accuracy))
plt.plot([0,1], [0,1], linestyle = "--", color = "r", label = "Reference line")
plt.xlabel("FALSE POSITIVE RATE")
plt.ylabel("TRUE POSITIVE RATE")
plt.title("ROC CURVE")
plt.legend()
plt.show()
print ("The AUC score of MLP Classifier is ",auc_mlp )
The AUC score of MLP Classifier is  1.0

The ROC Curve analysis of all 3 algorithms yielded perfect performance with AUC Score of 1 indicating perfect classification between the two classes.

Exploring LSTM for NIDS:¶

LSTM Network are known for their ability to capture long-term dependencies and are widely used in sequential data analysis. However, the model's tend to overfit while used on this dataset and had less accuracy compared to the 3 models used here. There was a significant gap between training accuracy and validation accuracy indicating overfitting of the data. A snippet of the code and graph is given below:

1.png2.png

Visualizing the Performance of the Models¶

The Scatter plot helps us to visualize how well the model's prediction allign with the actual values. The plot below indicates a high level of accuracy between the predicted and actual values.

In [23]:
fig = go.Figure()
fig = make_subplots (rows = 3, cols = 1)

fig.add_trace(go.Scatter(x = list(range(len(Y_test))), y = Y_test, mode = "markers", name = "Actual Value"), row = 1,col=1)
fig.add_trace(go.Scatter(x = list(range(len(Y_test))), y = random_forest_predicted, mode = "lines", name = "Random Forest Predicted"), row = 1,col=1)


fig.add_trace(go.Scatter(x = list(range(len(Y_test))), y = Y_test, mode = "markers", name = "Actual Value"), row = 2,col=1)
fig.add_trace(go.Scatter(x = list(range(len(Y_test))), y = xgb_predicted, mode = "lines", name = "XGB Predicted"), row = 2,col=1)

fig.add_trace(go.Scatter(x = list(range(len(Y_test))), y = Y_test, mode = "markers", name = "Actual Value"), row = 3,col=1)
fig.add_trace(go.Scatter(x = list(range(len(Y_test))), y = mlp_predicted, mode = "lines", name = "MLP Predicted"), row = 3,col=1)

fig.update_layout(height = 800, width = 800, title ="Actual vs Predicted values")
fig.update_xaxes(title_text = "No Of Data Points")
fig.update_yaxes(title_text = "Label", row =1, col =1)
fig.update_yaxes(title_text = "Label", row =2, col =1)
fig.update_yaxes(title_text = "Label", row =3, col =1)
fig.show()

Generating Warning:¶

The function "initiate_warning" is responsible for generating a warning and blocking the source ip address when a network intrusion is detected.

In [24]:
def initiate_warning(ip):
    print(f"WARNING! Network intrusion detected from the  source IP address {IPV4_SRC_ADDR}" )
    os.system(f"sudo iptables -A INPUT -s {IPV4_SRC_ADDR} -j DROP")
    print(f"The IP {IPV4_SRC_ADDR} has been BLOCKED!")
In [25]:
IPV4_SRC_ADDR = "175.451.760"
initiate_warning(IPV4_SRC_ADDR)
WARNING! Network intrusion detected from the  source IP address 175.451.760
The IP 175.451.760 has been BLOCKED!

IMG-20230626-WA0012.jpg

DISCUSSION:¶

After comprehensive analysis of the 3 algorithms,it is evident that each algorithm has showcased remarkable performance in the context of NIDS. The accuracy score and precision of MLP & XGB is slightly higher than that of Random forest. Random forest while performing exceptionally well exhibited slightly lower precision for intrusion activities.MLP & XGB have perfect precision , recall and F1 Score indicating that they can accurately classify intrusive and non-intrusive activities. Considering the performance of all three algorithms it is evident that each one has their unique advantages for NIDS tasks and they have performed remarkably well on the dataset offering valuable insights for effective intrusion detection and network security.

LIMITATIONS:¶

Even though Random forest, XGB and Multilayer Perceptron(MLP) algorithms offer promising results for NIDS,they are not immune to limitations. The limitations include:

  1. Sensitivity to hyperparameters: To gain the best performance from all three algorithms, careful tuning of the hyperparameter is required. Identifying the hyperparameters is time consuming and require domain expertise making the task challenging.
  2. Handling encrypted Traffic : Encryption hides the content of the network traffic which makes it diffiucult to detect anomalies and attack limiting the effectiveness of the algorithms.
  3. Dynamic threat environment: The algorithms need to be continously updated and adapted to maintain pace with randomly evolving network based attacks.
  4. High computanional requirements: NIDS using MLP & XGB is computationally expensive to train and require large computational resources and bandwidth, potentially impacting the network performance.

CONCLUSION:¶

The Network intrusion Detection System developed for B-76 Technologies has displayed favourable results for accuray, precision and recall. The succesful implementation of this system empowers the company with an essential tool for protecting a network from unauthorized access and malicious activities. They have the ability to detect anomalies and alert the administrators of potential security breaches. Thus, the NIDS improves the company's overall security posture by accurately classifying network traffic as normal or malicious. Although NIDS are not infallible and have limitations in detecting advanced attacks, integrating with additional security measures and continously fine tuning can increase their effectiveness. In conclusion, NIDS can enable the company to identify and mitigate potential threats, uyltimately boltering the network's security infrasteructure.

REFERENCES:¶

  1. research.unsw.edu.au. (n.d.). The UNSW-NB15 Dataset | UNSW Research. [online] Available at: https://research.unsw.edu.au/projects/unsw-nb15-dataset.

  2. Ahmad, M., Riaz, Q., Zeeshan, M., Tahir, H., Haider, S.A. and Khan, M.S. (2021). Intrusion detection in internet of things using supervised machine learning based on application and transport layer features using UNSW-NB15 data-set. EURASIP Journal on Wireless Communications and Networking, 2021(1). doi:https://doi.org/10.1186/s13638-021-01893-8.

  3. Subrata Maji (2020). Building an Intrusion Detection System on UNSW-NB15 Dataset Based on Machine Learning Algorithm. [online] Medium. Available at: https://medium.com/@subrata.maji16/building-an-intrusion-detection-system-on-unsw-nb15-dataset-based-on-machine-learning-algorithm-16b1600996f5.

  4. Moustafa, N. and Slay, J. (2015). UNSW-NB15: a comprehensive data set for network intrusion detection systems (UNSW-NB15 network data set). [online] IEEE Xplore. doi:https://doi.org/10.1109/MilCIS.2015.7348942.

  5. Al, M.S. et (2019). Network Based Intrusion Detection Using the UNSW-NB15 Dataset. International Journal of Computing and Digital Systems, [online] 8(5), p.477. Available at: https://www.academia.edu/40842534/Network_Based_Intrusion_Detection_Using_the_UNSW_NB15_Dataset [Accessed 26 Jun. 2023].

Screenshot 2023-06-26 at 3.27.51 PM.png